In-class Exercise 5 - Geographically Weighted Logistic Regression (GWLR) and Application

Overview

In this lesson, you will learn the basic concepts and methods of logistic regression specially designed for geographical data. Upon completion of this lesson, you will able to:

  • Explain the similarities and differences between Logistic Regression (LR) algorithm versus Geographical weighted Logistic Regression (GWLR) algorithm.

  • Calibrate predictive models by using appropriate Geographically Weighted Logistic Regression algorithm for geographical data.

Content

  • Basic concepts and principles of Logistic Regression

  • Geographically Weighted Logistic Regression methods

    • Weighting functions (kernel)

    • Weighting schemes

    • Bandwidth

    • Interpreting and Visualizing

Notes on Logistics Regression

  • Relatively larger samples required

  • Ensure no mulit-collinearity

Model Variables

Dependent Variables: Water Point Status (i.e. Functional vs Non-Functional)

Independent Variables:

  • distance_to_primary_road;

  • distance_to_secondary_road;

  • distance_to_tertiary_road;

  • distance_to_city;

  • distance_to_town;

  • water_point_population;

  • local_population_1km;

  • usage_capacity;

  • is_urban;

  • water_source_clean

Packages Used

  • sf - for importing and handling geospatial data

  • tmap - for creating thematic maps, such as Choropleth maps

  • spdep - a collection of functions to create spatial weights matrix object

  • funModeling - a collection of functions related to exploratory data analysis, data preparation, and model performance

  • tidyverse -a collection of packages for performing data science tasks

  • The following tidyverse packages will be used:

    • readr for importing delimited text file

    • tidyr for manipulating and tidying data

    • dplyr for wrangling and transforming data

    • ggplot2 for data visualisation

  • corrplot - A graphical display of a correlation matrix, confidence interval.
  • ggpubr - provides some easy-to-use functions for creating and customizing ‘ggplot2’- based publication ready plots.

  • blorr - Tools for building binary logistic regression models

  • GWmodel - Techniques from a particular branch of spatial statistics,termed geographically-weighted (GW) models. GW models suit situations when data are not described well by some global model, but where there are spatial regions where a suitably localised calibration provides a better description.

    • GW summary statistics

    • GW principal components analysis

    • GW discriminant analysis

    • GW regression

  • skimr - provides a frictionless approach to summary statistics which conforms to the principle of least surprise, displaying summary statistics the user can skim quickly to understand their data.

  • Caret - Comprehensive framework for building machine learning models in R.

The code chunk to install and load the packages is shown below:

pacman::p_load(sf, tidyverse, funModeling, blorr,corrplot, ggpubr, spdep, GWmodel, tmap, skimr, caret)

Data Set

There are two rds data-sets we will be using:

  • Osun

  • Osun_wp_sf

Importing Data Files

Using read_rds, we import the “Osun” and “Osun_wp_sf” rds into R using the code chunk below:

Osun_wp_sf <- read_rds("data/rds/Osun_wp_sf.rds")
Osun <- read_rds("data/rds/Osun.rds")

Next, we will use glimpse to check on the imported data:

glimpse(Osun_wp_sf)
Rows: 4,760
Columns: 75
$ row_id                     <dbl> 70578, 66401, 65607, 68989, 67708, 66419, 6…
$ source                     <chr> "Federal Ministry of Water Resources, Niger…
$ lat_deg                    <dbl> 7.759448, 8.031187, 7.891137, 7.509588, 7.4…
$ lon_deg                    <dbl> 4.563998, 4.637400, 4.713438, 4.265002, 4.3…
$ report_date                <chr> "05/11/2015 12:00:00 AM", "04/30/2015 12:00…
$ status_id                  <chr> "No", "No", "No", "No", "Yes", "Yes", "Yes"…
$ water_source_clean         <chr> "Borehole", "Borehole", "Borehole", "Boreho…
$ water_source_category      <chr> "Well", "Well", "Well", "Well", "Well", "We…
$ water_tech_clean           <chr> "Mechanized Pump", "Mechanized Pump", "Mech…
$ water_tech_category        <chr> "Mechanized Pump", "Mechanized Pump", "Mech…
$ facility_type              <chr> "Improved", "Improved", "Improved", "Improv…
$ clean_country_name         <chr> "Nigeria", "Nigeria", "Nigeria", "Nigeria",…
$ clean_adm1                 <chr> "Osun", "Osun", "Osun", "Osun", "Osun", "Os…
$ clean_adm2                 <chr> "Osogbo", "Odo-Otin", "Boripe", "Ayedire", …
$ clean_adm3                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ clean_adm4                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ install_year               <dbl> NA, 2004, 2006, 2014, 2011, 2007, 2007, 200…
$ installer                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ rehab_year                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ rehabilitator              <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ management_clean           <chr> NA, NA, NA, NA, NA, "Community Management",…
$ status_clean               <chr> "Abandoned/Decommissioned", "Abandoned/Deco…
$ pay                        <chr> "No", "No", "No", "No", "No", "No", "No", "…
$ fecal_coliform_presence    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fecal_coliform_value       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ subjective_quality         <chr> "Acceptable quality", "Acceptable quality",…
$ activity_id                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ scheme_id                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ wpdx_id                    <chr> "6FV6QH57+QHH", "6FW62JJP+FXC", "6FV6VPR7+F…
$ notes                      <chr> "COSTAIN  area, Osogbo", "Igbaye", "Araromi…
$ orig_lnk                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ photo_lnk                  <chr> "https://akvoflow-55.s3.amazonaws.com/image…
$ country_id                 <chr> "NG", "NG", "NG", "NG", "NG", "NG", "NG", "…
$ data_lnk                   <chr> "https://catalog.waterpointdata.org/dataset…
$ distance_to_primary_road   <dbl> 167.82235, 4133.71306, 6559.76182, 8260.715…
$ distance_to_secondary_road <dbl> 838.9185, 1162.6246, 4625.0324, 5854.5731, …
$ distance_to_tertiary_road  <dbl> 1181.107236, 9.012647, 58.314987, 2013.6515…
$ distance_to_city           <dbl> 2449.2931, 16704.1923, 21516.8495, 31765.68…
$ distance_to_town           <dbl> 9463.295, 5176.899, 8589.103, 16386.459, 23…
$ water_point_history        <chr> "{\"2015-05-11\": {\"photo_lnk\": \"https:/…
$ rehab_priority             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ water_point_population     <dbl> NA, NA, NA, NA, 0, 508, 162, 362, 3562, 217…
$ local_population_1km       <dbl> NA, NA, NA, NA, 70, 647, 449, 1611, 7199, 2…
$ crucialness_score          <dbl> NA, NA, NA, NA, NA, 0.785162287, 0.36080178…
$ pressure_score             <dbl> NA, NA, NA, NA, NA, 1.69333333, 0.54000000,…
$ usage_capacity             <dbl> 1000, 1000, 1000, 300, 1000, 300, 300, 300,…
$ is_urban                   <lgl> TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, FALS…
$ days_since_report          <dbl> 2689, 2700, 2688, 2693, 2701, 2692, 2681, 2…
$ staleness_score            <dbl> 42.84384, 42.69554, 42.85735, 42.78986, 42.…
$ latest_record              <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T…
$ location_id                <dbl> 239556, 230405, 240009, 236400, 229231, 237…
$ cluster_size               <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2…
$ clean_country_id           <chr> "NGA", "NGA", "NGA", "NGA", "NGA", "NGA", "…
$ country_name               <chr> "Nigeria", "Nigeria", "Nigeria", "Nigeria",…
$ water_source               <chr> "Improved Tube well or borehole", "Improved…
$ water_tech                 <chr> "Motorised", "Motorised", "Motorised", "yet…
$ status                     <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRU…
$ adm2                       <chr> "Osogbo", "Odo-Otin", "Boripe", "Aiyedire",…
$ adm3                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ management                 <chr> NA, NA, NA, NA, NA, "Community Management",…
$ adm1                       <chr> "Osun", "Osun", "Osun", "Osun", "Osun", "Os…
$ `New Georeferenced Column` <chr> "POINT (4.5639983 7.7594483)", "POINT (4.63…
$ lat_deg_original           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ lat_lon_deg                <chr> "(7.7594483°, 4.5639983°)", "(8.0311867°, 4…
$ lon_deg_original           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ public_data_source         <chr> "https://catalog.waterpointdata.org/datafil…
$ converted                  <chr> "#status_id, #water_source, #pay, #status, …
$ count                      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ created_timestamp          <chr> "06/30/2020 12:56:07 PM", "06/30/2020 12:56…
$ updated_timestamp          <chr> "06/30/2020 12:56:07 PM", "06/30/2020 12:56…
$ Geometry                   <POINT [m]> POINT (236239.7 417577), POINT (24463…
$ ADM2_EN                    <chr> "Osogbo", "Odo-Otin", "Boripe", "Aiyedire",…
$ ADM2_PCODE                 <chr> "NG030030", "NG030025", "NG030006", "NG0300…
$ ADM1_EN                    <chr> "Osun", "Osun", "Osun", "Osun", "Osun", "Os…
$ ADM1_PCODE                 <chr> "NG030", "NG030", "NG030", "NG030", "NG030"…
glimpse(Osun)
Rows: 30
Columns: 5
$ ADM2_EN    <chr> "Aiyedade", "Aiyedire", "Atakumosa East", "Atakumosa West",…
$ ADM2_PCODE <chr> "NG030001", "NG030002", "NG030003", "NG030004", "NG030005",…
$ ADM1_EN    <chr> "Osun", "Osun", "Osun", "Osun", "Osun", "Osun", "Osun", "Os…
$ ADM1_PCODE <chr> "NG030", "NG030", "NG030", "NG030", "NG030", "NG030", "NG03…
$ geometry   <MULTIPOLYGON [m]> MULTIPOLYGON (((213526.6 34..., MULTIPOLYGON (…

Exploratory Data Analysis

Count and percentage of water points by Functional Status

We will plot the count and percentage of water points by Functional Status - True and False:

ggplot(data= Osun_wp_sf, 
       aes(x = status)) +
  geom_bar(aes(fill = status), show.legend = TRUE) +
  ylim(0,3000)+
  geom_text(stat = 'count',
           aes(label= paste0(after_stat(count), ', ', 
                            round(after_stat(count)/
                            sum(after_stat(count))*100, 2), '%')), 
                            vjust= -0.5, size= 5) +
  labs( y = "Count of Water Points", x = "Status of Water Points",
        title = "Water Point by Functional Status") +
  theme(text = element_text(size = 10), 
        axis.ticks.x= element_blank(), 
        axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 1, size=10))

Plot Map of water points by Functional Status

tmap_mode("view") 
tmap mode set to interactive viewing
tm_shape(Osun) +
tmap_options(check.and.fix = TRUE) + 
               tm_polygons(alpha = 0.4) +
tm_shape(Osun_wp_sf) +
  tm_dots(col = "status",
          alpha = 0.6)+
  tm_view(set.zoom.limits = c(9,12))

Summary Statistics with skimr

Osun_wp_sf %>% 
  skim()
Warning: Couldn't find skimmers for class: sfc_POINT, sfc; No user-defined `sfl`
provided. Falling back to `character`.
Data summary
Name Piped data
Number of rows 4760
Number of columns 75
_______________________
Column type frequency:
character 47
logical 5
numeric 23
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
source 0 1.00 5 44 0 2 0
report_date 0 1.00 22 22 0 42 0
status_id 0 1.00 2 7 0 3 0
water_source_clean 0 1.00 8 22 0 3 0
water_source_category 0 1.00 4 6 0 2 0
water_tech_clean 24 0.99 9 23 0 3 0
water_tech_category 24 0.99 9 15 0 2 0
facility_type 0 1.00 8 8 0 1 0
clean_country_name 0 1.00 7 7 0 1 0
clean_adm1 0 1.00 3 5 0 5 0
clean_adm2 0 1.00 3 14 0 35 0
clean_adm3 4760 0.00 NA NA 0 0 0
clean_adm4 4760 0.00 NA NA 0 0 0
installer 4760 0.00 NA NA 0 0 0
management_clean 1573 0.67 5 37 0 7 0
status_clean 0 1.00 9 32 0 7 0
pay 0 1.00 2 39 0 7 0
fecal_coliform_presence 4760 0.00 NA NA 0 0 0
subjective_quality 0 1.00 18 20 0 4 0
activity_id 4757 0.00 36 36 0 3 0
scheme_id 4760 0.00 NA NA 0 0 0
wpdx_id 0 1.00 12 12 0 4760 0
notes 0 1.00 2 96 0 3502 0
orig_lnk 4757 0.00 84 84 0 1 0
photo_lnk 41 0.99 84 84 0 4719 0
country_id 0 1.00 2 2 0 1 0
data_lnk 0 1.00 79 96 0 2 0
water_point_history 0 1.00 142 834 0 4750 0
clean_country_id 0 1.00 3 3 0 1 0
country_name 0 1.00 7 7 0 1 0
water_source 0 1.00 8 30 0 4 0
water_tech 0 1.00 5 37 0 20 0
adm2 0 1.00 3 14 0 33 0
adm3 4760 0.00 NA NA 0 0 0
management 1573 0.67 5 47 0 7 0
adm1 0 1.00 4 5 0 4 0
New Georeferenced Column 0 1.00 16 35 0 4760 0
lat_lon_deg 0 1.00 13 32 0 4760 0
public_data_source 0 1.00 84 102 0 2 0
converted 0 1.00 53 53 0 1 0
created_timestamp 0 1.00 22 22 0 2 0
updated_timestamp 0 1.00 22 22 0 2 0
Geometry 0 1.00 33 37 0 4760 0
ADM2_EN 0 1.00 3 14 0 30 0
ADM2_PCODE 0 1.00 8 8 0 30 0
ADM1_EN 0 1.00 4 4 0 1 0
ADM1_PCODE 0 1.00 5 5 0 1 0

Variable type: logical

skim_variable n_missing complete_rate mean count
rehab_year 4760 0 NaN :
rehabilitator 4760 0 NaN :
is_urban 0 1 0.39 FAL: 2884, TRU: 1876
latest_record 0 1 1.00 TRU: 4760
status 0 1 0.56 TRU: 2642, FAL: 2118

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
row_id 0 1.00 68550.48 10216.94 49601.00 66874.75 68244.50 69562.25 471319.00 ▇▁▁▁▁
lat_deg 0 1.00 7.68 0.22 7.06 7.51 7.71 7.88 8.06 ▁▂▇▇▇
lon_deg 0 1.00 4.54 0.21 4.08 4.36 4.56 4.71 5.06 ▃▆▇▇▂
install_year 1144 0.76 2008.63 6.04 1917.00 2006.00 2010.00 2013.00 2015.00 ▁▁▁▁▇
fecal_coliform_value 4760 0.00 NaN NA NA NA NA NA NA
distance_to_primary_road 0 1.00 5021.53 5648.34 0.01 719.36 2972.78 7314.73 26909.86 ▇▂▁▁▁
distance_to_secondary_road 0 1.00 3750.47 3938.63 0.15 460.90 2554.25 5791.94 19559.48 ▇▃▁▁▁
distance_to_tertiary_road 0 1.00 1259.28 1680.04 0.02 121.25 521.77 1834.42 10966.27 ▇▂▁▁▁
distance_to_city 0 1.00 16663.99 10960.82 53.05 7930.75 15030.41 24255.75 47934.34 ▇▇▆▃▁
distance_to_town 0 1.00 16726.59 12452.65 30.00 6876.92 12204.53 27739.46 44020.64 ▇▅▃▃▂
rehab_priority 2654 0.44 489.33 1658.81 0.00 7.00 91.50 376.25 29697.00 ▇▁▁▁▁
water_point_population 4 1.00 513.58 1458.92 0.00 14.00 119.00 433.25 29697.00 ▇▁▁▁▁
local_population_1km 4 1.00 2727.16 4189.46 0.00 176.00 1032.00 3717.00 36118.00 ▇▁▁▁▁
crucialness_score 798 0.83 0.26 0.28 0.00 0.07 0.15 0.35 1.00 ▇▃▁▁▁
pressure_score 798 0.83 1.46 4.16 0.00 0.12 0.41 1.24 93.69 ▇▁▁▁▁
usage_capacity 0 1.00 560.74 338.46 300.00 300.00 300.00 1000.00 1000.00 ▇▁▁▁▅
days_since_report 0 1.00 2692.69 41.92 1483.00 2688.00 2693.00 2700.00 4645.00 ▁▇▁▁▁
staleness_score 0 1.00 42.80 0.58 23.13 42.70 42.79 42.86 62.66 ▁▁▇▁▁
location_id 0 1.00 235865.49 6657.60 23741.00 230638.75 236199.50 240061.25 267454.00 ▁▁▁▁▇
cluster_size 0 1.00 1.05 0.25 1.00 1.00 1.00 1.00 4.00 ▇▁▁▁▁
lat_deg_original 4760 0.00 NaN NA NA NA NA NA NA
lon_deg_original 4760 0.00 NaN NA NA NA NA NA NA
count 0 1.00 1.00 0.00 1.00 1.00 1.00 1.00 1.00 ▁▁▇▁▁

Data Wrangling

  • Excluded missing values

  • Recoded data type of usage capacity and convert to Factor rather than keeping it as numeric values. Converting to factor helps to derive dummy variable.

Osun_wp_sf_clean <- Osun_wp_sf %>%
  filter_at(vars(status,
                 distance_to_primary_road,
                 distance_to_secondary_road,
                 distance_to_tertiary_road,
                 distance_to_city,
                 distance_to_town,
                 water_point_population,
                 local_population_1km,
                 usage_capacity,
                 is_urban,
                 water_source_clean),
            all_vars(!is.na(.))) %>%
  mutate(usage_capacity = as.factor(usage_capacity))

Correlation Analysis

Drop the geometry columns from SF data frame:

Osun_wp <- Osun_wp_sf_clean %>%
  select(c(7,35:39,42:43,46:47,57)) %>%
  st_set_geometry(NULL)
cluster_vars.cor = cor(
  Osun_wp[,2:7])
corrplot.mixed(cluster_vars.cor,
         lower = "ellipse", 
               upper = "number",
               tl.pos = "lt",
               diag = "l",
               tl.col = "black", number.cex=1.1, tl.cex=1.5)
Warning in corrplot(corr, type = "upper", method = upper, diag = TRUE, tl.pos =
tl.pos, : Not been able to calculate text margin, please try again with a clean
new empty window using {plot.new(); dev.off()} or reduce tl.cex

Interpretation of Correlation Plot

From the correlation plot result above, we can conclude that there is no multi-collinearity. There is no highly correlated variables, defined as correlation coefficient, r >= +/-0.8.

Building a Logistic Regression Model

Build a First Prototype of Logistic Regression Model

We will use all the model variables listed to build our first version of the logistic regression model. We will use the code chunk below:

modelv0 <- glm(status~ distance_to_primary_road+
                 distance_to_secondary_road+
                 distance_to_tertiary_road+
                 distance_to_city+
                 distance_to_town+
                 water_point_population+
                 local_population_1km+
                 usage_capacity+
                 is_urban+
                 water_source_clean,
             data = Osun_wp_sf_clean,
             family = binomial(link = 'logit'))

Instead of using typical R report, we use the blr_regress() function from blorr. It will help generate a better report.

blr_regress(modelv0)
                             Model Overview                              
------------------------------------------------------------------------
Data Set    Resp Var    Obs.    Df. Model    Df. Residual    Convergence 
------------------------------------------------------------------------
  data       status     4756      4755           4744           TRUE     
------------------------------------------------------------------------

                    Response Summary                     
--------------------------------------------------------
Outcome        Frequency        Outcome        Frequency 
--------------------------------------------------------
   0             2114              1             2642    
--------------------------------------------------------

                                 Maximum Likelihood Estimates                                   
-----------------------------------------------------------------------------------------------
               Parameter                    DF    Estimate    Std. Error    z value     Pr(>|z|) 
-----------------------------------------------------------------------------------------------
              (Intercept)                   1      0.3887        0.1124      3.4588       5e-04 
        distance_to_primary_road            1      0.0000        0.0000     -0.7153      0.4744 
       distance_to_secondary_road           1      0.0000        0.0000     -0.5530      0.5802 
       distance_to_tertiary_road            1      1e-04         0.0000      4.6708      0.0000 
            distance_to_city                1      0.0000        0.0000     -4.7574      0.0000 
            distance_to_town                1      0.0000        0.0000     -4.9170      0.0000 
         water_point_population             1      -5e-04        0.0000    -11.3686      0.0000 
          local_population_1km              1      3e-04         0.0000     19.2953      0.0000 
           usage_capacity1000               1     -0.6230        0.0697     -8.9366      0.0000 
              is_urbanTRUE                  1     -0.2971        0.0819     -3.6294       3e-04 
water_source_cleanProtected Shallow Well    1      0.5040        0.0857      5.8783      0.0000 
   water_source_cleanProtected Spring       1      1.2882        0.4388      2.9359      0.0033 
-----------------------------------------------------------------------------------------------

 Association of Predicted Probabilities and Observed Responses  
---------------------------------------------------------------
% Concordant          0.7347          Somers' D        0.4693   
% Discordant          0.2653          Gamma            0.4693   
% Tied                0.0000          Tau-a            0.2318   
Pairs                5585188          c                0.7347   
---------------------------------------------------------------

Interpretation of Logistic Regression

Screen through P-Values of Variables first

To exclude the following variables since P-Value is >0.05:

  • distance_to_primary_road

  • distance_to_secondary_road

Assess Categorical Variables:

  • Positive values implies an above average correlation

    • water_source_cleanProtected Shallow Well

    • iwater_source_cleanProtected Spring

  • Negative values implies an below average correlation

    • usage_capacity1000

    • is_urbanTRUE

Assess Continuous Variables:

  • Positive values implies an direct correlation

  • Negative values implies an inverse correlation

  • Magnitude of the Value gives the strength of correlation

blr_confusion_matrix(modelv0, cutoff = 0.5)
Confusion Matrix and Statistics 

          Reference
Prediction FALSE TRUE
         0  1301  738
         1   813 1904

                Accuracy : 0.6739 
     No Information Rate : 0.4445 

                   Kappa : 0.3373 

McNemars's Test P-Value  : 0.0602 

             Sensitivity : 0.7207 
             Specificity : 0.6154 
          Pos Pred Value : 0.7008 
          Neg Pred Value : 0.6381 
              Prevalence : 0.5555 
          Detection Rate : 0.4003 
    Detection Prevalence : 0.5713 
       Balanced Accuracy : 0.6680 
               Precision : 0.7008 
                  Recall : 0.7207 

        'Positive' Class : 1

Refine Logistics Regression Model

Since the previous model showed that distance_to_primary_road and distance_to_secondary_road were variables to exclude, we refined the logistic regression model by removing them from the original model. We run the refined model using the following code chunk:

model <- glm(status~ distance_to_tertiary_road+
                 distance_to_city+
                 distance_to_town+
                 water_point_population+
                 local_population_1km+
                 usage_capacity+
                 is_urban+
                 water_source_clean,
             data = Osun_wp_sf_clean,
             family = binomial(link = 'logit'))
blr_regress(model)
                             Model Overview                              
------------------------------------------------------------------------
Data Set    Resp Var    Obs.    Df. Model    Df. Residual    Convergence 
------------------------------------------------------------------------
  data       status     4756      4755           4746           TRUE     
------------------------------------------------------------------------

                    Response Summary                     
--------------------------------------------------------
Outcome        Frequency        Outcome        Frequency 
--------------------------------------------------------
   0             2114              1             2642    
--------------------------------------------------------

                                 Maximum Likelihood Estimates                                   
-----------------------------------------------------------------------------------------------
               Parameter                    DF    Estimate    Std. Error    z value     Pr(>|z|) 
-----------------------------------------------------------------------------------------------
              (Intercept)                   1      0.3540        0.1055      3.3541       8e-04 
       distance_to_tertiary_road            1      1e-04         0.0000      4.9096      0.0000 
            distance_to_city                1      0.0000        0.0000     -5.2022      0.0000 
            distance_to_town                1      0.0000        0.0000     -5.4660      0.0000 
         water_point_population             1      -5e-04        0.0000    -11.3902      0.0000 
          local_population_1km              1      3e-04         0.0000     19.4069      0.0000 
           usage_capacity1000               1     -0.6206        0.0697     -8.9081      0.0000 
              is_urbanTRUE                  1     -0.2667        0.0747     -3.5690       4e-04 
water_source_cleanProtected Shallow Well    1      0.4947        0.0850      5.8228      0.0000 
   water_source_cleanProtected Spring       1      1.2790        0.4384      2.9174      0.0035 
-----------------------------------------------------------------------------------------------

 Association of Predicted Probabilities and Observed Responses  
---------------------------------------------------------------
% Concordant          0.7349          Somers' D        0.4697   
% Discordant          0.2651          Gamma            0.4697   
% Tied                0.0000          Tau-a            0.2320   
Pairs                5585188          c                0.7349   
---------------------------------------------------------------

Interpretation of Revised Model

blr_confusion_matrix(model, cutoff = 0.5)
Confusion Matrix and Statistics 

          Reference
Prediction FALSE TRUE
         0  1300  743
         1   814 1899

                Accuracy : 0.6726 
     No Information Rate : 0.4445 

                   Kappa : 0.3348 

McNemars's Test P-Value  : 0.0761 

             Sensitivity : 0.7188 
             Specificity : 0.6149 
          Pos Pred Value : 0.7000 
          Neg Pred Value : 0.6363 
              Prevalence : 0.5555 
          Detection Rate : 0.3993 
    Detection Prevalence : 0.5704 
       Balanced Accuracy : 0.6669 
               Precision : 0.7000 
                  Recall : 0.7188 

        'Positive' Class : 1

Notice, all variables now have P-Value less than 0.05.

Building Geographically Weighted Regression Model

Converting from SF to SP data frame

Osun_wp_sp <- Osun_wp_sf_clean %>%
  select(c(status,
                 distance_to_primary_road,
                 distance_to_secondary_road,
                 distance_to_tertiary_road,
                 distance_to_city,
                 distance_to_town,
                 water_point_population,
                 local_population_1km,
                 usage_capacity,
                 is_urban,
                 water_source_clean)) %>%
  as_Spatial()
Osun_wp_sp
class       : SpatialPointsDataFrame 
features    : 4756 
extent      : 182502.4, 290751, 340054.1, 450905.3  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=4 +lon_0=8.5 +k=0.99975 +x_0=670553.98 +y_0=0 +a=6378249.145 +rf=293.465 +towgs84=-92,-93,122,0,0,0,0 +units=m +no_defs 
variables   : 11
names       : status, distance_to_primary_road, distance_to_secondary_road, distance_to_tertiary_road, distance_to_city, distance_to_town, water_point_population, local_population_1km, usage_capacity, is_urban, water_source_clean 
min values  :      0,        0.014461356813335,          0.152195902540837,         0.017815121653488, 53.0461399623541, 30.0019777713073,                      0,                    0,           1000,        0,           Borehole 
max values  :      1,         26909.8616132094,           19559.4793799085,          10966.2705628969,  47934.343603562, 44020.6393368124,                  29697,                36118,            300,        1,   Protected Spring 
# Notice - No need to drop geometry column as you are converting one data set with geometrical features to another with geometrical features

# Spatial point data frame

Building a Fixed Bandwidth GWR model

bw.fixed <- bw.ggwr(status ~
                  distance_to_primary_road+
                  distance_to_secondary_road+
                  distance_to_tertiary_road+
                  distance_to_city+
                  distance_to_town+
                  water_point_population+
                  local_population_1km+
                  usage_capacity+
                  is_urban+
                  water_source_clean,
                data = Osun_wp_sp,
                family = "binomial",
                approach =  "AIC",
                kernel = "gaussian",
                adaptive = FALSE,
                longlat = FALSE)
Take a cup of tea and have a break, it will take a few minutes.
          -----A kind suggestion from GWmodel development group
 Iteration    Log-Likelihood:(With bandwidth:  95768.67 )
=========================
       0        -2889 
       1        -2836 
       2        -2830 
       3        -2829 
       4        -2829 
       5        -2829 
Fixed bandwidth: 95768.67 AICc value: 5684.357 
 Iteration    Log-Likelihood:(With bandwidth:  59200.13 )
=========================
       0        -2875 
       1        -2818 
       2        -2810 
       3        -2808 
       4        -2808 
       5        -2808 
Fixed bandwidth: 59200.13 AICc value: 5646.785 
 Iteration    Log-Likelihood:(With bandwidth:  36599.53 )
=========================
       0        -2847 
       1        -2781 
       2        -2768 
       3        -2765 
       4        -2765 
       5        -2765 
       6        -2765 
Fixed bandwidth: 36599.53 AICc value: 5575.148 
 Iteration    Log-Likelihood:(With bandwidth:  22631.59 )
=========================
       0        -2798 
       1        -2719 
       2        -2698 
       3        -2693 
       4        -2693 
       5        -2693 
       6        -2693 
Fixed bandwidth: 22631.59 AICc value: 5466.883 
 Iteration    Log-Likelihood:(With bandwidth:  13998.93 )
=========================
       0        -2720 
       1        -2622 
       2        -2590 
       3        -2581 
       4        -2580 
       5        -2580 
       6        -2580 
       7        -2580 
Fixed bandwidth: 13998.93 AICc value: 5324.578 
 Iteration    Log-Likelihood:(With bandwidth:  8663.649 )
=========================
       0        -2601 
       1        -2476 
       2        -2431 
       3        -2419 
       4        -2417 
       5        -2417 
       6        -2417 
       7        -2417 
Fixed bandwidth: 8663.649 AICc value: 5163.61 
 Iteration    Log-Likelihood:(With bandwidth:  5366.266 )
=========================
       0        -2436 
       1        -2268 
       2        -2194 
       3        -2167 
       4        -2161 
       5        -2161 
       6        -2161 
       7        -2161 
       8        -2161 
       9        -2161 
Fixed bandwidth: 5366.266 AICc value: 4990.587 
 Iteration    Log-Likelihood:(With bandwidth:  3328.371 )
=========================
       0        -2157 
       1        -1922 
       2        -1802 
       3        -1739 
       4        -1713 
       5        -1713 
Fixed bandwidth: 3328.371 AICc value: 4798.288 
 Iteration    Log-Likelihood:(With bandwidth:  2068.882 )
=========================
       0        -1751 
       1        -1421 
       2        -1238 
       3        -1133 
       4        -1084 
       5        -1084 
Fixed bandwidth: 2068.882 AICc value: 4837.017 
 Iteration    Log-Likelihood:(With bandwidth:  4106.777 )
=========================
       0        -2297 
       1        -2095 
       2        -1997 
       3        -1951 
       4        -1938 
       5        -1936 
       6        -1936 
       7        -1936 
       8        -1936 
Fixed bandwidth: 4106.777 AICc value: 4873.161 
 Iteration    Log-Likelihood:(With bandwidth:  2847.289 )
=========================
       0        -2036 
       1        -1771 
       2        -1633 
       3        -1558 
       4        -1525 
       5        -1525 
Fixed bandwidth: 2847.289 AICc value: 4768.192 
 Iteration    Log-Likelihood:(With bandwidth:  2549.964 )
=========================
       0        -1941 
       1        -1655 
       2        -1503 
       3        -1417 
       4        -1378 
       5        -1378 
Fixed bandwidth: 2549.964 AICc value: 4762.212 
 Iteration    Log-Likelihood:(With bandwidth:  2366.207 )
=========================
       0        -1874 
       1        -1573 
       2        -1410 
       3        -1316 
       4        -1274 
       5        -1274 
Fixed bandwidth: 2366.207 AICc value: 4773.081 
 Iteration    Log-Likelihood:(With bandwidth:  2663.532 )
=========================
       0        -1979 
       1        -1702 
       2        -1555 
       3        -1474 
       4        -1438 
       5        -1438 
Fixed bandwidth: 2663.532 AICc value: 4762.568 
 Iteration    Log-Likelihood:(With bandwidth:  2479.775 )
=========================
       0        -1917 
       1        -1625 
       2        -1468 
       3        -1380 
       4        -1339 
       5        -1339 
Fixed bandwidth: 2479.775 AICc value: 4764.294 
 Iteration    Log-Likelihood:(With bandwidth:  2593.343 )
=========================
       0        -1956 
       1        -1674 
       2        -1523 
       3        -1439 
       4        -1401 
       5        -1401 
Fixed bandwidth: 2593.343 AICc value: 4761.813 
 Iteration    Log-Likelihood:(With bandwidth:  2620.153 )
=========================
       0        -1965 
       1        -1685 
       2        -1536 
       3        -1453 
       4        -1415 
       5        -1415 
Fixed bandwidth: 2620.153 AICc value: 4761.89 
 Iteration    Log-Likelihood:(With bandwidth:  2576.774 )
=========================
       0        -1950 
       1        -1667 
       2        -1515 
       3        -1431 
       4        -1393 
       5        -1393 
Fixed bandwidth: 2576.774 AICc value: 4761.889 
 Iteration    Log-Likelihood:(With bandwidth:  2603.584 )
=========================
       0        -1960 
       1        -1678 
       2        -1528 
       3        -1445 
       4        -1407 
       5        -1407 
Fixed bandwidth: 2603.584 AICc value: 4761.813 
 Iteration    Log-Likelihood:(With bandwidth:  2609.913 )
=========================
       0        -1962 
       1        -1680 
       2        -1531 
       3        -1448 
       4        -1410 
       5        -1410 
Fixed bandwidth: 2609.913 AICc value: 4761.831 
 Iteration    Log-Likelihood:(With bandwidth:  2599.672 )
=========================
       0        -1958 
       1        -1676 
       2        -1526 
       3        -1443 
       4        -1405 
       5        -1405 
Fixed bandwidth: 2599.672 AICc value: 4761.809 
 Iteration    Log-Likelihood:(With bandwidth:  2597.255 )
=========================
       0        -1957 
       1        -1675 
       2        -1525 
       3        -1441 
       4        -1403 
       5        -1403 
Fixed bandwidth: 2597.255 AICc value: 4761.809 
bw.fixed
[1] 2599.672
gwlr.fixed <- ggwr.basic(status ~
                  distance_to_primary_road+
                  distance_to_secondary_road+
                  distance_to_tertiary_road+
                  distance_to_city+
                  distance_to_town+
                  water_point_population+
                  local_population_1km+
                  usage_capacity+
                  is_urban+
                  water_source_clean,
                data = Osun_wp_sp,
                bw = bw.fixed,
                family = "binomial",
                kernel = "gaussian",
                adaptive = FALSE,
                longlat = FALSE)
Warning in proj4string(data): CRS object has comment, which is lost in output; in tests, see
https://cran.r-project.org/web/packages/sp/vignettes/CRS_warnings.html
Warning in proj4string(regression.points): CRS object has comment, which is lost in output; in tests, see
https://cran.r-project.org/web/packages/sp/vignettes/CRS_warnings.html
 Iteration    Log-Likelihood
=========================
       0        -1958 
       1        -1676 
       2        -1526 
       3        -1443 
       4        -1405 
       5        -1405 
Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj =
prefer_proj): Discarded ellps unknown in Proj4 definition: +proj=tmerc +lat_0=4
+lon_0=8.5 +k=0.99975 +x_0=670553.98 +y_0=0 +a=6378249.145 +rf=293.465 +units=m
+no_defs +type=crs
Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj =
prefer_proj): Discarded datum unknown in Proj4 definition
gwlr.fixed
   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************
   Program starts at: 2022-12-17 12:33:25 
   Call:
   ggwr.basic(formula = status ~ distance_to_primary_road + distance_to_secondary_road + 
    distance_to_tertiary_road + distance_to_city + distance_to_town + 
    water_point_population + local_population_1km + usage_capacity + 
    is_urban + water_source_clean, data = Osun_wp_sp, bw = bw.fixed, 
    family = "binomial", kernel = "gaussian", adaptive = FALSE, 
    longlat = FALSE)

   Dependent (y) variable:  status
   Independent variables:  distance_to_primary_road distance_to_secondary_road distance_to_tertiary_road distance_to_city distance_to_town water_point_population local_population_1km usage_capacity is_urban water_source_clean
   Number of data points: 4756
   Used family: binomial
   ***********************************************************************
   *              Results of Generalized linear Regression               *
   ***********************************************************************

Call:
NULL

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-124.555    -1.755     1.072     1.742    34.333  

Coefficients:
                                           Estimate Std. Error z value Pr(>|z|)
Intercept                                 3.887e-01  1.124e-01   3.459 0.000543
distance_to_primary_road                 -4.642e-06  6.490e-06  -0.715 0.474422
distance_to_secondary_road               -5.143e-06  9.299e-06  -0.553 0.580230
distance_to_tertiary_road                 9.683e-05  2.073e-05   4.671 3.00e-06
distance_to_city                         -1.686e-05  3.544e-06  -4.757 1.96e-06
distance_to_town                         -1.480e-05  3.009e-06  -4.917 8.79e-07
water_point_population                   -5.097e-04  4.484e-05 -11.369  < 2e-16
local_population_1km                      3.451e-04  1.788e-05  19.295  < 2e-16
usage_capacity1000                       -6.230e-01  6.972e-02  -8.937  < 2e-16
is_urbanTRUE                             -2.971e-01  8.185e-02  -3.629 0.000284
water_source_cleanProtected Shallow Well  5.040e-01  8.574e-02   5.878 4.14e-09
water_source_cleanProtected Spring        1.288e+00  4.388e-01   2.936 0.003325
                                            
Intercept                                ***
distance_to_primary_road                    
distance_to_secondary_road                  
distance_to_tertiary_road                ***
distance_to_city                         ***
distance_to_town                         ***
water_point_population                   ***
local_population_1km                     ***
usage_capacity1000                       ***
is_urbanTRUE                             ***
water_source_cleanProtected Shallow Well ***
water_source_cleanProtected Spring       ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6534.5  on 4755  degrees of freedom
Residual deviance: 5688.0  on 4744  degrees of freedom
AIC: 5712

Number of Fisher Scoring iterations: 5


 AICc:  5712.099
 Pseudo R-square value:  0.1295351
   ***********************************************************************
   *          Results of Geographically Weighted Regression              *
   ***********************************************************************

   *********************Model calibration information*********************
   Kernel function: gaussian 
   Fixed bandwidth: 2599.672 
   Regression points: the same locations as observations are used.
   Distance metric: A distance matrix is specified for this model calibration.

   ************Summary of Generalized GWR coefficient estimates:**********
                                                   Min.     1st Qu.      Median
   Intercept                                -8.7228e+02 -4.9955e+00  1.7600e+00
   distance_to_primary_road                 -1.9389e-02 -4.8031e-04  2.9618e-05
   distance_to_secondary_road               -1.5921e-02 -3.7551e-04  1.2317e-04
   distance_to_tertiary_road                -1.5618e-02 -4.2368e-04  7.6179e-05
   distance_to_city                         -1.8416e-02 -5.6217e-04 -1.2726e-04
   distance_to_town                         -2.2411e-02 -5.7283e-04 -1.5155e-04
   water_point_population                   -5.2208e-02 -2.2767e-03 -9.8875e-04
   local_population_1km                     -1.2698e-01  4.9952e-04  1.0638e-03
   usage_capacity1000                       -2.0772e+01 -9.7231e-01 -4.1592e-01
   is_urbanTRUE                             -1.9790e+02 -4.2908e+00 -1.6864e+00
   water_source_cleanProtected.Shallow.Well -2.0789e+01 -4.5190e-01  5.3340e-01
   water_source_cleanProtected.Spring       -5.2235e+02 -5.5977e+00  2.5441e+00
                                                3rd Qu.      Max.
   Intercept                                 1.2763e+01 1073.2154
   distance_to_primary_road                  4.8443e-04    0.0142
   distance_to_secondary_road                6.0692e-04    0.0258
   distance_to_tertiary_road                 6.6814e-04    0.0128
   distance_to_city                          2.3718e-04    0.0150
   distance_to_town                          1.9271e-04    0.0224
   water_point_population                    5.0102e-04    0.1309
   local_population_1km                      1.8157e-03    0.0392
   usage_capacity1000                        3.0322e-01    5.9281
   is_urbanTRUE                              1.2841e+00  744.3097
   water_source_cleanProtected.Shallow.Well  1.7849e+00   67.6343
   water_source_cleanProtected.Spring        6.7663e+00  317.4123
   ************************Diagnostic information*************************
   Number of data points: 4756 
   GW Deviance: 2795.084 
   AIC : 4414.606 
   AICc : 4747.423 
   Pseudo R-square value:  0.5722559 

   ***********************************************************************
   Program stops at: 2022-12-17 12:34:12 

At the global model (no geographical), no AICc. Hence, can only compare AIC.

Model Assessment

To assess the performance of the gwLR, firstly, we will convert the SDF object as data frame by using the code chunk below:

#SDF is a data frame. Write it out as a dataframe

gwr.fixed <- as.data.frame(gwlr.fixed$SDF)

Next, we will label yhat values greater or equal to 0.5 (Probability Threshold Cut-off) into 1 else 0. The result of the logic comparison operation will be saved into a field called most.

Create a new field called “most” using mutate

gwr.fixed <- gwr.fixed %>%
  mutate(most = ifelse(
    gwr.fixed$yhat >= 0.5, T, F))
gwr.fixed$y <- as.factor(gwr.fixed$y)
gwr.fixed$most <- as.factor(gwr.fixed$most)
CM <- confusionMatrix(data=gwr.fixed$most, reference = gwr.fixed$y)
CM
Confusion Matrix and Statistics

          Reference
Prediction FALSE TRUE
     FALSE  1824  263
     TRUE    290 2379
                                          
               Accuracy : 0.8837          
                 95% CI : (0.8743, 0.8927)
    No Information Rate : 0.5555          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.7642          
                                          
 Mcnemar's Test P-Value : 0.2689          
                                          
            Sensitivity : 0.8628          
            Specificity : 0.9005          
         Pos Pred Value : 0.8740          
         Neg Pred Value : 0.8913          
             Prevalence : 0.4445          
         Detection Rate : 0.3835          
   Detection Prevalence : 0.4388          
      Balanced Accuracy : 0.8816          
                                          
       'Positive' Class : FALSE           
                                          

Interpretation of Results

  • Improvement in AIC

  • Improvement in Accuracy, Sensitivity and Specificity, True Positive Prediction Value as seen in diagram below:

Visualization

Preparing data frame

Osun_wp_sf_selected <- Osun_wp_sf_clean %>%
  select(c(ADM2_EN, ADM2_PCODE,
           ADM1_EN, ADM1_PCODE,
           status))
gwr_sf.fixed <- cbind(Osun_wp_sf_selected, gwr.fixed)

Visualizing Coefficient Estimates

tmap_mode("view")
tmap mode set to interactive viewing
prob_T <- tm_shape(Osun) +
  tm_polygons(alpha = 0.1) +
tm_shape(gwr_sf.fixed) +
  tm_dots(col = "yhat",
          border.col = "gray60",
          border.lwd = 1) +
  tm_view(set.zoom.limits = c(8,14))
prob_T